Search Result

Select

Imputation algorithm for hybrid information system of incomplete data analysis approach based on rough set theory

PENG Li, ZHANG Haiqing, LI Daiwei, TANG Dan, YU Xi, HE Lei

Journal of Computer Applications 2021, 41 (3): 677-685. DOI: 10.11772/j.issn.1001-9081.2020060894

Abstract （406）

PDF （1135KB）（644）

Save

Concerning the problem of the poor imputation capability of the ROUgh Set Theory based Incomplete Data Analysis Approach (ROUSTIDA) for the Hybrid Information System (HIS) containing multiple attributes such as discrete (e.g., integer, string, and enumeration), continuous (e.g., floating) and missing attributes in the real-world application, a Rough Set Theory based Hybrid Information System for Missing Data Imputation Approach (RSHISMIS) was proposed. Firstly, according to the idea of decision attribute equivalence class partition, HIS was divided to solve the problem of decision rule conflict problem that might occurs after imputation. Secondly, a hybrid distance matrix was defined to reasonably quantify the similarity between objects in order to filter the samples with imputation capability and to overcome the shortcoming of ROUSTIDA that cannot handle with continuous attributes. Thirdly, the nearest-neighbor idea was combined to solve the problem of ROUSTIDA that it cannot impute the data with the same missing attribute in the case of conflict between the attribute values of non-discriminant objects. Finally, experiments were conducted on 10 UCI datasets, and the proposed method was compared with classical algorithms including ROUSTIDA, K Nearest Neighbor Imputation (KNNI), Random Forest Imputation (RFI), and Matrix Factorization (MF). Experimental results show that the proposed method outperforms ROUSTIDA by 81% in recall averagely and 5% to 53% in precision. Meanwhile, the method has the maximal 0.12 reduction of Normalized Root Mean Square Error (NRMSE) compared with ROUSTIDA. Besides, the classification accuracy of the method is 7% higher on average than that of ROUSTIDA, and is also better than those of the imputation algorithms KNNI, RFI and MF.

Reference | Related Articles | Metrics

Select

Erasure code with low recovery-overhead in distributed storage systems

ZHANG Hang, LIU Shanzheng, TANG Dan, CAI Hongliang

Journal of Computer Applications 2020, 40 (10): 2942-2950. DOI: 10.11772/j.issn.1001-9081.2020010127

Abstract （393）

PDF （1250KB）（929）

Save

Erasure code technology is a typical data fault tolerance method in distributed storage systems. Compared with multi-copy technology, it can provide high data reliability with low storage overhead. However, the high repair cost limits the practical applications of erasure code technology. Aiming at problems of high repair cost, complex encoding and poor flexibility of existing erasure codes, a simple-encoding erasure code with low repair cost - Rotation Group Repairable Code (RGRC) was proposed. According to RGRC, multiple strips were combined into a strip set at first. After that, the association relationship between the strips was used to hierarchically rotate and encode the data blocks in the strip set to obtain the corresponding redundant blocks. RGRC greatly reduced the amount of data needed to be read and transmitted in the process of single-node repair, thus saving a lot of network bandwidth resources. Meanwhile, RGRC still retained high fault tolerance when solving the problem of high repair cost of a single node. And, in order to meet the different needs of distributed storage systems, RGRC was able to flexibly weigh the storage overhead and repair cost of the system. Comparison experiments were conducted on a distributed storage system, the experimental analysis shows that compared with RS (Reed-Solomon) codes, LRC (Locally Repairable Codes), basic-Pyramid, DLRC (Dynamic Local Reconstruction Codes), pLRC (proactive Locally Repairable Codes), GRC (Group Repairable Codes) and UFP-LRC (Unequal Failure Protection based Local Reconstruction Codes), RGRC can reduce the repair cost of single node repair by 14%-61% through adding a small amount of storage overhead, and reduces the repair time by 14%-58%.

Reference | Related Articles | Metrics

Select

Array erasure codes based on coding chains with multiple slopes

TANG Dan, YANG Haopeng, WANG Fuchao

Journal of Computer Applications 2017, 37 (4): 936-940. DOI: 10.11772/j.issn.1001-9081.2017.04.0936

Abstract （706）

PDF （854KB）（475）

Save

In view of the problem that the fault tolerance capability is low and strong constraint conditions need to be satisfied in the construction of most array erasure codes at present, a new type of array erasure codes based on coding chains was proposed. In the new array erasure codes, coding chains with different slopes were used to organize the relationship among data elements and check elements, so as to achieve infinite fault tolerance capability in theory; the strong constraint conditions like the prime number limitation was avoided in construction, which is easy to be practical and extensible. Simulation results show that, compared with Reed-Solomon codes (RS codes), the efficiency of the proposed array erasure codes based on coding chains is more than 2 orders of magnitude; under the condition of fixed fault tolerance, its storage efficiency can be improved with the increase of the strip size. In addition, the update penalty and repair cost of the array codes is a fixed constant, which will not increase with the expansion of the storage system scale or the increase of fault tolerance capability.

Reference | Related Articles | Metrics

Select

Cascaded and low-consuming online method for large-scale Web page category acquisition

WANG Yaqiang, TANG Ming, ZENG Qin, TANG Dan, SHU Hongping

Journal of Computer Applications 2017, 37 (4): 924-927. DOI: 10.11772/j.issn.1001-9081.2017.04.0924

Abstract （537）

PDF （847KB）（537）

Save

To balance the contradiction between accuracy and resource cost during constructing an automatic system for collecting massive well-classified Web pages, a cascaded and low-consuming online method for large-scale Web page category acquisition was proposed, which utilizes a cascaded strategy to integrate online and offline Web page classifiers so as to take full of use of their advantages. An online Web page classifier trained by features in the anchor text was used as the first-level classifier, and then the confidence of the classification results was computed by the information entropy of the posterior probability. The second-level classifier was triggered when the confidence is larger than the predefined threshold obtained by Multi-Objective Particle Swarm Optimization (MOPSO). The features were extracted from the downloaded Web pages by the secondary classifier, then they were classified by an offline classifier pre-trained by Web pages. In the comparison experiments with single online classification and single offline classification, the proposed method dramatically increased the F1 measure of classification by 10.85% and 4.57% respectively. Moreover, compared with the single online classification, the efficiency of the proposed method did not decrease a lot (less than 30%), while the efficiency was improved about 70% compared with single offline classification. The results demonstrate that the proposed method not only has a more powerful classification ability, but also significantly reduces the computing overhead and bandwidth consumption.

Reference | Related Articles | Metrics